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Abstract. The ever-increasing amount of data in biomedical research, 
and in cancer research in particular, needs to be managed to support 
efficient data access, exchange and integration. Existing software infras- 
tructures, such caGrid, support access to distributed information anno- 
tated with a domain ontology. However, caGrid's current querying func- 
tionality depends on the structure of individual data resources without 
exploiting the semantic annotations. In this paper, we present the de- 
sign and development of an ontology-based querying functionality that 
consists of: the generation of OWL2 ontologies from the underlying data 
resources metadata and a query rewriting and translation process based 
on reasoning, which converts a query at the domain ontology level into 
queries at the software infrastructure level. We present a detailed analy- 
sis of our approach as well as an extensive performance evaluation. While 
the implementation and evaluation was performed for the caGrid infras- 
tructure, the approach could be applicable to other model and metadata- 
driven environments for data sharing. 

Keywords: ontology, query, caGrid, UML, 0WL2, sequence pattern, 
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1 Introduction 

In the biomedical sciences, the use, exchange and integration of the ever-increasing 
amount of data has become paramount to accelerate the discovery of new ap- 
proaches for the detection, diagnosis, treatment and prevention of diseases. In 
particular, this applies to cancer, for which the US National Cancer Institute 
(NCI) and the UK National Cancer Research Institute (NCRI) have implemented 
the caBIG®!^ programme and the NCRI Informatics Initiative, looking at build- 
ing and deploying software infrastructure to manage and analyse data generated 
from heterogenous data sources. 

In this paper, we provide an analysis of the caGrid^l software infrastruc- 
ture developed within the NCI caBIG® programme and extend it with richer 
querying capabilities. caGrid supports a collaborative information network for 
sharing cancer research data, and deals with syntactic and semantic interop- 
erability of the data resources in a service-oriented model-driven architecture. 
Semantic interoperability is achieved by using a metadata registry, which main- 
tains information models annotated with concepts from a domain ontology: the 

^ caBIG® stands for cancer Biomedical Informatics Grid® 
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NCI thesaurus (NCIt)!^. However, the query functionahty provided in caGrid 
does not take into account the semantic annotations, but it only rehes on each 
individual information model. 

Our methodology is based on extending the caGrid service-oriented model- 
driven infrastructure with additional services to support ontology-based queries 
over the distributed data resources. In this way, the biomedical researchers, as 
the end-users of our system, will be able to query cancer data by building queries 
using their domain knowledge (expressed as concepts from the NCIt ontology) 
rather than having to know the underlying models. This also means that the 
queries are reusable across resources, which is not the case in the caGrid infras- 
tructure. This functionality will be incorporated into the NCRI ONcology Infor- 
mation eXchange (ONIjQ. Our approach involves a customised transformation 
from annotated information models to an ontological representation using the 
Web Ontology Language version 2 (OWlQ. This representation supports an- 
notations based on a primary concept and a list of qualifiers. Based on these 
ontological representations of the data resources, we have designed and devel- 
oped a query rewriting and translation approach that converts concept-based 
queries into the query language supported by the caGrid infrastructure. This 
approach is general and could be used to support other target query languages, 
as the only step dependent on caGrid is the last one. This work presents signif- 
icant improvements over our previous work[3], as we have significantly modified 
and improved the OWL representation and the design and implementation of 
the query rewriting and translation steps. We have developed a caGrid ana- 
lytical service for the transformation from an annotated information model to 
OWL. Additionally, we present an analysis of the caGrid query language and 
information together with an extensive performance evaluation that justifies the 
applicability of our solution. 

This paper is structured as follows. Section [2] introduces background material 
on the caGrid infrastructure. Section [3] presents an analysis of the caGrid query 
functionality and the type of queries supported by its query language. Then, we 
present in section [4?!] the OWL representation that is used for query rewriting 
and translation, which in turn is described in Section |4.2[ The implementation 
details and performance evaluation results are given in Sections 4.3 and 4.4 re- 



spectively. The evaluation includes an analysis of the generated ontologies as well 
as several performance metrics for OWL generation and query rewriting, which 
justify the viability of our approach. After comparing our approach with related 
work in Section |5j we conclude the paper in Section [6j including considerations 
for future work. 



* 'http:/ /www. ncri-onix.org.uk/' 

^ OWL is a recommendation from the World Wide Web Consortium 
(W3C) and the language overview for its second version can be found at 
: / /www. w3.org/TR/owl2-overview/ 
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2 Background 

caBIG® JJ is an NCI programme whose aim is to create a virtual and feder- 
ated informatics infrastructure for sharing data, tools and connect scientists and 
organisations in the cancer research community. The computing middleware in 
caBIG® is called caGrid, which is a Gridjl] extended to support data modelling 
and semantics PJ. caGrid has a number of core services and corresponding appli- 
cation programming interfaces (APIs), which we will introduce next, by analogy 
with the metadata hierarchy [S], as per Figure [l] 

The metadata hierar- 
chy represents how the se- 
mantics of raw data {in- 
stance data) can be aug- 
mented by overlaying meta- 
data of increasing descrip- 
tiveness [S]. The syntac- 
tic metadata refers to the Fig. 1: caGrid semantic infrastructure. 
language format and data types, and in caGrid is represented by XML schemas 
managed by the Global Model Exchange (GME)|Tj service, which exposes them 
through the GME AP^ The structural metadata gives form to the units of data. 
In caGrid, it is implemented as an object-oriented virtualisation of the underly- 
ing data resources[l and it is represented as UMlj^ models. These UML models 
can be accessed through the Discovery AP^ The purpose of the referent meta- 
data is to represent the linkages between the different data models. In caGrid, 
the linkages are provided by a metadata registry, called caDSBj^ based on the 
ISO/IEC 11179 standarc^ caDSR manages common data elements (CDEs) and 
exposes them through the caDSR API. The domain metadata represents what 
the data is about. It is implemented by a domain conceptualisation, usually in 
the form of an ontology [5]. In the caGrid case, the NCIt ontology [2. is used, 
accessed via the LexEVS AP^j^ Finally, the rules constitute an overarching 
layer that can be applied to all the aforementioned layers. Rules can be used 
to constrain and extend the semantics of metadata specifications at any of the 
abstraction levels [5 . In the current caGrid semantic infrastructure, there is no 
equivalent to the rule metadata. 

A data resource owner can share the data by developing caGrid data ser- 
vices using common interfaces and metadata, as described above. In this way, a 
data service encapsulates the data, which is kept in native formats (including, 
for example, relational data or flat files), exposing an access interface based on 
the object-oriented (UML) model of the underlying resource. The common in- 
terface also exposes a query processor based on the Common Query Language 
(CQL) defined for caGrid. CQL is an object-oriented query language reflecting 
the underlying object model of the data resource while abstracting the physical 
representation of the datajlj. At the time of writing, there exist two versions of 



littp://cagrid.org/display/gme/l 

UML stands for the Unified Modeling Language, a specification of the Object Man- 
agement Group® (OMG®) 



http:/ /cagrid.org/display/metadatalS/Discovery 
caDSR stands for cancer Data Standards Repository 
http:/ /met^adata-stds.org/11179/ 



https: / /cabig.nci. nih.gov/concepts /EVS / 
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CQL and there is a pre-release version of the latest onj^ More details on CQL 
are given in Section [3] 

caGrid also supports basic distributed aggregations and joins of queries over 
multiple data services by means of the caGrid Federated Query Infrastructure]^ 
through a distributed extension of CQL called DCQL. Thus, caGrid relies on 
D/CQL - custom query languages based on the structural characteristics of 
the resources. In other words, caGrid builds a 'structural layer', where queries 
are expressed in terms of objects, attributes and associated objects, without al- 
lowing for semantic queries. D/CQL are evolving to provide richer structural 
queries as new requirements arise from different caBIG® projects. However, 
these query languages do not allow for data extraction based on semantic in- 
formation. Thus, a shortcoming of caGrid is that does not currently exploit the 
referent and domain metadata maintained for its data services. Additionally, as 
already mentioned, it is not possible to specify rules about the models nor the 
domain semantics. 

As stated in the introduction, this work advocates the extension of the ca- 
Grid infrastructure to exploit its rich metadata by building a semantic layer, 
using semantic web technologies to exploit caGrid's metadata. Additionally, this 
extension is capable of: a) accommodating other resources with different ways of 
dealing with metadata, and b) specifying rules at different levels of abstraction. 



3 Analysis of the caGrid Query Language 

A CQL query is defined by an XML document, which must comply to a specified 
XML schemgp^ The schema indicates that a CQL query must specify a (Target) 
element, which is the data type of the query result. Optionally, an (Attribute) 
element might indicate a predicate over an attribute of the object with (Target) 
type and an (Association) may specify a link with a related object. In Table [l] 
we show how a CQL query is built recursively presenting it as a context-free 
grammar, where (CQLQuery) is the start symbol, e is the empty string and 
(xsd:string) is the non-tcrmiual variable representing the XSD:string data type. 

So, CQL traverses the UML class diagram graph, where the (Target) is the 
initial class, the (Association) conditions allow for path navigation by travers- 
ing sequences of consecutive classes and (Attribute) conditions apply locally to 
individual classes. The terminal symbols (Group) and (Groupl) represent the 
combination of two or more constraints over a particular node in the UML class 
graph. 

4 Ontology-based queries over the caGrid infrastructure 

As shown before, the caGrid queries rely on the structure of the underlying data 
resources, i.e. their UML models. Thus, a biomedical researcher interested in 



http://caKrid.orK/display/dataservice s/CQL-l-2 



htt p : / / caKrid .orK/display/ f qp / Home 



The CQL XML schema is available at: http:/ /cagrid.orK/display/dataservices/CQL-|-Schemas 



(CQLQucry) ^(Target) | 

(Target) (QueryModifier) 
(Target) ^(Namo) (Attribute) | 

(Name) (Association) | 
(Name) (Group) 
(Attribute) ^(Name) (Predieate) (Value) 
(Group) — >(LogicalOp) (Attribute) (Groupl) | 
(LogicalOp) (Association) (Groupl) 
(Groupl) ->( Attribute) (Groupl) | 

(Association) (Groupl) | 
(Group) |e 
(Name) — > (xsd:string) 
(RoleName) — >(xsd:string) 



(LogicalOp) ^AND |oR 

(Predicate) ^EQUAL.TO |nOT_EQUAL_TO | 
LIKE |lS_NULL | 
IS_NOT_NULL |lESS_THAN | 
LESS_THAN_EQUAL_TO | 
GREATER_THAN I 
GREATER_THAN_EQUAL_TO 
(Association) — ^(RoleName) | 

(RoleName) (Association) | 
(RoleName) (Attribute) | 
(RoleName) (Group) 
( Value) — > (xsd:string) 

(QueryModifier) — ^(DistinctAttribute) | 

( Distinct Attribute ) (AttributeN 



Table 1: CQL query context-free grammar 



retrieving data about, for example, a particular gene of interest will need to 
explore the UML model of each relevant data service and build a query consid- 
ering the specific attributes and associations of the class maintaining the Gene 
objects. The queries can be built programmatically or also through the caGrid 
portaj^ which allows to explore the UML models and provides a query builder 
based on these models. 

In this work, we propose a system that allows the user to concentrate on 
the concepts from the domain, as represented by the NCIt ontology on cancer, 
and build the ontology-based queries which are high-leve^^ and descriptiv^^ 
Thus, the ontology-based queries can be applicable to any of the underlying data 
resources. 

Apart from the cancer concepts found on NCIt, the queries combine elements 
from an ontology we built with metadata on UML modelj^ namely the UML 
model ontology. This ontology contains OWL classes to represent UML classes 
and attributes {UMLClass, UML Attribute) , OWL object properties to represent 
UML associations and the relationship between a UML class and its attributes 
(hasAssociation, hasAttribute) and a data property to represent the values of 
attributes (hasValue). 



Ihttp: / /cagrid-portal. nci.nih.gov' 
^® By a high-level query, we mean a query that can be written without specific details 

about the structure of the target resource. 
^"^ By a descriptive query, we refer to queries that provide the criteria for the desired 

data rather than the procedure to find the data. 

We will see later, than the queries could also use elements from the list ontology[B]. 
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Some simple example queries]^ are: a) Specimen to retrieve all the objects 
that are annotated with the Specimen concept; b) Gene and hasAttribute some 
(GeneSymbol and has Value value "BRCA%") to find all the genes whose symbol 
starts with the string BRCA; c) Single^Nucleotide^Polymorphism and hasAsso- 
ciation some (Gene and hasAttribute some (GeneSymbol and hasValue value 
"TGFBl")) to obtain all the SNPs associated with the Gene Transforming 
Growth Factor Beta J [5^ . In our system, these queries could be submitted to 
any data service, and they will be converted to the specific CQL query. 

We note that the third query specifies SNPs that are associated with genes. 
This association may be present in different ways in two separate UML models. 
For example, the two corresponding classes may have a direct UML association, 
or the association may arise by traversing an association path from the first 
class to the second one. In order for our system to deal with those paths of 
associations, without the user requiring to know the specific underlying UML 
model, we define the hasAssociation property as transitive and use reasoning to 
determine the paths. 

Next, we introduce our transformation from caGrid models to an 0WL2 
representation and the query rewriting/translation approach, which transforms 
ontology-based queries into CQL queries. The 0WL2 ontologies provide an uni- 
fied view of the UML models and their semantic annotations, which allows us 
to apply reasoning over them. 



4.1 OWL Representation of caGrid Information Models 



OWL model of UML class di- 
agrams. First, we present our customised 
UML-to-OWL transformation. This trans- 
formation differs from previous approaches, 
as explained in Section [5] Next, we de- 
scribe the transformation and use the 
portion of the caBIO 4.2 information 
model in Figure[2]to give examples. Ev- 
ery UML element is related to its coun- 
terpart in the UML model ontology: all 
UML classes and attributes are defined 
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Fig. 2: Part of UML class diagram for 
caBIO 4.2 

as subclasses of UMLGlass and UMLAttribute, respectively (see equations Eq. 
[l]and Eq. [2] belo-wj^; all the UML associations are sub-properties of hasAsso- 
ciation (Eq. [4]), and the datatype property hasValue is used to specify the type 

The example queries are given in Manchester OWL Syntax and are just intended to 
show queries retrieving objects from a UML class, a UML class with a condition over 
an attribute and two associated UML classes with a restriction over an attribute of 
one of the classes, respectively. 

The prefixes used in the equations are: c; for the caBIO 4.2 ontology, u: for the UML 
model ontology, n: for the NCIt ontology and I: for the list ontology. We note that 
the name of an OWL class corresponding to an attribute includes the class name to 
avoid duplications and for associations, it includes its domain and range. 
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of the attributes (Eq. [s]) as an existential restriction. Contrary to other UML- 
to-OWL transformations, we represent UML attributes as OWL classes. This is 
required so that the ontology-based queries can include the concepts associated 
with attributes. 

c:Chromosome £ u:UMLClass (1) 
c:Chromosome_number E u:UMLAttribute (2) 
c:Chromosome_number £ 3 u:lias Value. xsd:string (3) 
c:ChromosomeJocationCollection_Location £ u:hasAssociation (4) 

UML subclass and superclass relationships are represented with subsumption 
(Eq. [5]). For each UML class, existential restrictions are added for its associa- 
tions (Eq. [6| and attributes (Eq. [7| . While UML does not explicitly represent 
inherited associations, our OWL representation makes them explicit, modeling 
the semantics of UML. For example, as the UML class Location has an associ- 
ation chromosome with the class Chromosome, this association is inherited on 
the subclass CytogeneticLocation (Eq. |8]). 

ciCytogeneticLocatoin E c:Location (5) 
c: Chromosome £ 3 c:Chromosome_IocationCoIlection_Location. 

c:Location (6) 

c: Chromosome E 3 u:hasAttribute.c:Chromosome_nmTiber (7) 
c: CytogeneticLocation £ 3 c:Location_chromosome_Chromosome. 

c:Chromosome (8) 

We note that the generated OWL ontologies belong to OWL2ELj7j, an 0WL2 
profile specifically designed to allow for efficient reasoning with large terminolo- 
gies, which is polynomial in the size of the ontology. While 0WL2EL disallows 
universal quantification on properties, it does allow the inclusion of transitive 
properties. Thus, it is suitable for our UML-to-OWL transformation customised 
for the rewriting approach as outlined before. 

OWL Representation of the Semantic Annotations. Apart from rep- 
resenting the UML model, we also model its mapping to NCIt, as maintained 
in caDSR. Through the CDEs, UML elements are annotated with a primary 
concept, which indicates the meaning of the element. In turn, a list of qualifier 
concepts may be used to modify the primary concept, giving specific meaning. 
As 0WL2 does not natively supports the representation of lists, we used Drum- 
mond et aPs design pattern for sequences [5] to model primary concepts and 
qualifier lists. The following equations give some examples on how the semantic 
annotations of UML classes (Eq. [9]) and attributes (Eq. [lO| ) with a single concept 
are modelled. Equation |12| models the class c:SNPCytogeneticLocation as being 
a n:Location qualified with tChromosomc-Band and 
n:Single-Nucleotide-Polymorphism. 

c:Chromosome £ n:Chromosome (9) 
c:Chromosome_numer E n:Name (10) 
c:SNPCytogeneticLocation £ n:Locationn (l:OWLList n 

31:hasContents.n:Chromosome_Band n (11) 
3 l:hasNext.(l:OWLList n 

3 l:hasContents.n:Single_Nucleotide_Polymorphism)) 
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Module Extraction from NCI Thesaurus Ontology. The NCIt ontol- 
ogy is very large, as it provides a common vocabulary for the whole cancer 
domain|2]. Each caGrid data service is, in general, concerned with data per- 
taining to more specific domains than the whole NCIt ontology. Thus, for each 
caGrid data service referring to a subset E of the NCIt vocabulary, there is a 
subset of terms and relationships from NCIt that is relevant, called a module 
from the ontology [S]. The module Ai represents all knowledge about the terms 
of the signature S. One of the approaches torelevance is logic-based: the module 
A4 is relevant for the terms 2J if all the consequences of the ontology that can 
be expressed over E are also consequences of [5] . We follow that approach by 
Sattler et al [8] and extract an NCIt module for each of the information mod- 
els in caGrid. For succinctness and eSiciency, this module is used, as opposed 
to the whole NCIt ontology, for the semantic annotations of UML models and 
subsequent reasoning. However, we observe that we removed the disjoint axioms 
from the NCIt modules, as we noted before P[5] using subsumption to represent 
UML class to concept mapping may result in inconsistent ontologies as the an- 
notations for a single class may come from two high-level branches in NCIt that 
are declared as disjoint. 



4.2 Query Rev^rriting and Translation 

This section describes how an ontology-based query is rewritten and then trans- 
lated, first to an intermediate optimisation language and then to the target CQL 
language. While the overall approach is similar to our previous work[3], previ- 
ously we relied completely on justifications^lO and now we have extensively 
improved the approach by dealing with each of the steps independently. We 
provide the output of each step for the third query from Section |4j 

Parsing. First, the user query is syntactically parsed. The query uses con- 



cepts from the NCIt, the UML model (see Section 4.1 ) and the list ontologies 

UML Extraction. The NCIt concepts in the query are translated into spe- 
cific UML classes, by reasoning over the generated ontologies. Each concept is 
the super-class of a UML class or UML attribute, depending on their position 
on the query. Often, a single NCIt concept will correspond to many UML classes 
(or attributes) and, in such cases, each UML class is returned to form an indi- 
vidual query. Therefore, the outcome of the UML extraction is a combination 
of possible queries given the extracted UML classes or attributes. The outcome 
for our example query is: c:SNP and hasAssociation some (ciGene and hasAttribute 
some (ciGene^symbol and hasValue value "TGFBl")) . 

Data Values Extraction. As the generated ontologies do not contain in- 
stances, the semantic validation of the query, expressed as an OWL class expres- 
sion, must ignore the data expressions. This step extracts the data expressions, 
which will be reinserted later on. This step results in c:SNP and hasAssociation 
some (ciGene and hasAttnbute some (ciGene^symbol)) . 

Semantic Validation. We use a reasoner to check that the resulting query 
can be satisfied. If the query cannot be satisfied, subsequent rewriting of the 
query is halted. 
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Properties Path Finder. This step deals with the ontology corresponding 
to the UML model (the semantic annotations do not need to be considered 
any longer) and aims at finding the path of UML classes related through the 
transitive property hasAssociatior^^ The path finder rewrites the expression 
using non-transitive properties, corresponding to UML associations, by using an 
explanation generator [10] that retrieves the justification for two classes to be 
connected via the transitive property, and thus allowing to find the intermediate 
classes. The path finder may find more than one path between a set of nodes 
and, in such cases, will return each path as a combination of possible queries 
for user selection. One path for our example query is: c:SNP and hasAssociation 
some c:GeneRelativeLocation and hasAssociation some (c:Gene and hasAttribute some 
( c: Genesymbol) ) . 

Data Values Addition. At this point, we can retrieve the data expressions 
removed earlier and re-insert them into the corresponding OWL classes, resulting 
in c:SNP and hasAssoctatton some c:GeneRelativeLocation and hasAssociation some 
(c:Gene and hasAttribute some (c:Gene_symbol and hasValue value "TGFBl")). 

OWL Expression to MCC Translation. No calculus or algebra has been 
defined for the object-oriented query language CQL. To provide a translation 
with CQL as target language, we use the monoid comprehension calculus (MCC), 
as it is a formal framework to support object queries optimizations [TT] . Object 
queries involve collections (e.g. sets, lists, bags), whose semantics can be captured 
by monoid comprehensions (MC). In this paper, we only overview MCs and its 
use in our systerrj^ Our approach is similar to the work by Peim et al [T^ , but 
while they use an expansion algorithm to rewrite an OWL expression based on 
a set of acyclic set of definitions, we follow the specific steps described above. 

A MC takes the form ffi{e [] g}, where ffi is a monoicp^ operator called the 
accumulator^ e is the header and g = gi ,...,(?„, n > is a sequence of qualifiers. 
A qualifier can take the form of a generator^ v e' with v a range variable 
and e' an expression constructing a collection, or a filter predicate. The symbol 
lii denotes the accumulator for bagf^ For an OWL class expression from the 
previous step, an MCC expression is built such that: the header variable is 
determined by the first concept in the query and the qualifiers are built for each 
of the remaining expressions. The MCC expression for our example is: ta { 5 [] 
s <- SNP, r <- s.relativeLocationCoUection, r <- GeneRelativeLocation, g r.gene, g 
^ Gene, g.symbol=TGFBl } 



We note that the ontology is compliant with the OWL2 EL profile, as OWL2 
EL supports the use of transitive object properties. For more information, see 
http:/ /www. w3.org/TR/owl2-profiles/ 
For more details, we refer the reader to and [3] 

A monoid of type T is an algebraic structure defined by (ffi, Z^) where ® :TxT ->■ T 
is an associative funcion and is the left and right identity of ffi. A collection 
monoid is a monoid for a collection type (e.g. lists or bags) and must also specify a 
unit function building a singleton collection. 

For example, ta{x [] s <- {1,2}} is the monoid comprehension representing the bag 
{{1,2}}. 
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MMC to CQL Translation. Translating the MCC expression into CQL 
amounts to: define as Target the type of the variable that appears in the header 
and then, including an Association per each pair of generators, one determin- 
ing the name (the class to which they belong) and the other identifying the 
role name; include an Attribute restriction for each filter. As this last step is 
the only one involving CQL, only this last step requires to be modified to ex- 
tend our methodology to other model-driven architectures with a different target 
language. The resulting CQL in the example is: 

<ns 1 : CQLQucry xmlns : ns 1 — " h 1 1 p : / / CQL . caBIG/1/ gov . nih . nci . cagi id . CQLQuciy'' > 
<nsl : Target name—" gov . nih . nci . cabio . domain . SNP" > 

<nsl : Association name—" gov . nih . nci . cabio . domain . GcneRclativeLocation" 
roleName— " relativeLocationCollection"> 
<nsl : Association name—" gov . nih . nci . cabio . domain . Gene" rolcName— " gene" > 

<nsl : Attribute name=" symbol" p r e d i c a t e ="EQUAL_TO" value="TGFBr7> 
</nsl : Association> 
</nsl : Association > 
</nsl : Targct> 
</nsl : CQLQuery> 



4.3 Implementation 

We have implemented two modules, with the functionalities: a) an OWL gener- 
ator, which transforms a caGrid annotated UML model into an OWL ontology 
and includes the generation of a module from the NCIt containing the concepts 
relevant to the UML model; h) a query translation component, which takes as 
input a OWL class expression using concepts from the NCI thesaurus and trans- 
forms it into a CQL for a single data service. 

For the first module, we also produced a caGrid analytical service called 
OWL GenServiccp^ which provides a simple API for the extraction of modules 
from NCIt and for the ontology generation, given a specific information model. 

The implementation was done in Java and uses caGrid version l.^j^ the 
OWL API version 3.1.Cp^ (after upgrading from OWL API version 2), and relies 
on the reasoners Pellet 2. 2. £3 and HermiT 1.3.(0 



4.4 Performance Evaluation 



This section analyses the generated ontologies and presents two areas of per- 
formance evaluation that verify the viability of our approach. Since one impor- 
tant step in the query rewriting/translation process is the property path finder 
(see Section 4.2), we firstly introduce some metrics to assess the paths in the 



generated ontologies. These paths are sequences of concepts linked by object 
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properties. Secondly, we present the generation times for the module extraction, 
the ontology generation and the inference of the ontologies using both the Pellet 
and HermiT reasoners. These results show that the generation of the ontologies 
that make possible our approach is done in a timely fashion. Thirdly, we evalu- 
ate the performance of the query rewriting process, showing a breakdown of the 
constituent parts of the rewriting algorithm. For this evaluation, we considered 
two sets of five queries each run over the caBIO data servic^^ where each set 
consists of queries that involve paths of lengths one and two. The tests were run 
on a Red Hat Enterprise Linux Server release 5.3 (Tikanga) with 64 bits and 
48285 MB of RAM. 

This section analyses the generated ontologies and presents two areas of per- 
formance evaluation that verify the viability of our approach. Since one impor- 



tant step in the query rewriting/translation process — from Section 4.2 — is the 
property path finder, we firstly introduce some metrics to assess the sequences 
of concepts linked by object properties (paths) in the generated ontologies. Sec- 
ondly, we present the generation times for the module extraction, the ontology 
generation and the inference of the ontologies using both the Pellet and HermiT 
reasoners. These results show that the generation of the ontologies that make 
possible our approach take a short time. Thirdly, we evaluate the performance 
of the query rewriting process with a breakdown of the constituent parts of the 
algorithm. For this evaluation, we considered two sets of five queries each, where 
each set consists of queries that involve paths of lengths one and two. The re- 
sults were obtained by running on a Red Hat Enterprise Linux Server release 5.3 
(Tikanga) with 64 bits and 48285 MB of RAM. 

Throughout this section, we have grouped caGrid projects into three distinct 
subsets: projects that are available from the caDSR service, all data services that 
are registered with the caGrid default index servicep^ and Information Models 
(or InfoModels) (those models that are supported by a deployed service from the 
caGrid Index Service^|^ We note that the groups caGrid and InfoModels are 
the more relevant for our system, as only against these projects it is possible to 
execute CQL queries. While InfoModels include a single project from caDSR for 
a set of deployed services corresponding to that project, caGrid may include the 
results for several services that correspond to a single model. Thus, the caGrid 
results will be skewed according to the relative weight of services as opposed to 
models. 

Analysis of the OWL representation of the information models. 

While ontology metrics have been defined in several tools jl3J, these have fo- 



'http:/ /cabiogrid42.nci.nih.gov:80/wsrf/services/cagrid/CaBIO42GridSvc| 
http:/ /cagrid-index.nci.nih.gov:8080/wsrf/services/DofauItIndexService 
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It should be noted that not all caDSR projects are included in the metrics; some 
contained errors (their semantic metadata is not complete or refers to an older version 
of the NCI thesaurus) and some models are targeted for data modelling, rather than 
specifically holding data, making them not representative for our system. Out of 
the 136 projects in caDSR, 16 were excluded from the analysis for these reasons. 
However, none of the excluded projects had an associated service. Additionally, the 
caGrid subset has 63 services and InfoModels has 23 projects. 
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cused on basic metrics (e.g. number of classes) and semantic-based metrics (e.g. 
relationship richness) that allow for the comparison and quality evaluation of 
the ontologies. Here, we will focus on the presentation of some bespoke metrics 
we developed to measure the proliferation and complexity of paths within the 
ontologies, as these will ensure the viability of our approach. 

As seen in Section |4.2[ our rewriting process seeks to remove the upper- 
level and transitive object property hasAssociation and express the query using 
only non-transitive properties, which correspond to the UML associations in the 
models. In order to achieve this, we consider the paths between pairs of concepts 
from the query connected through the hasAssociation property. The calculation 
of these paths is not trivial; there may be many intermediate nodes and there 
may be more than one path for a given pair of concepts. We define a journey 
as a traversal from one concept to another. A journey may have one or many 
paths, which represent the possible routes that the traversal can take. Thus, it 
is important to evaluate these aspects of the ontologies in order to assess the 
viability of a rewriting tool. 

We propose the following metrics as a measure of complexity in this respect. 
The Longest Path is the maximum path length that may be computed within a 
given ontology. The longest path length provides an indication of the worse case 
for path calculation times. The Average Paths per Journey refiects the degree of 
path expansion within the rewriting algorithm, as each journey (e.g. from Node 
A to Node B) may have many different paths. The rewriting algorithm should 
return all possible paths as each path may refer to a different expression of the 
query. When we consider that a single query may include multiple independent 
journeys, the possible query rewritings can become very large. The Average 
Nodes per Path is the average number of nodes that must be visited in order to 
return a single path. These metrics can affect the path calculation time as well 
as the complexity of the resulting query. 




caDSR caGRID Info Models caDSR caGRID Info Models caDSR caGRID Info Models 



Fig. 3: The Path Metrics. 

Figure [3] illustrates three box-and- whisker plots with the results of the path 
metrics for each project subset. We observe that while the longest path can have 
up to 36 nodes, for 75 % of the projects in each category their length is less than 
17 or 18. The median of the average path length varies between 4 and 7 nodes 
over the three subsets, and for 75 % of the InfoModels the average path length is 
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less than 8. The median of the average paths is around 2 paths per journey, and 
for 75 % of the projects in each category the average path per journey is less than 
2.5. This indicates that we will be returning a low number of path combinations 
as a result. These results, then, verify that the paths within the ontologies are 
manageable and appropriate for our rewriting tool. We also note that in all the 
metric diagrams, the caGrid subset is often very densely clustered around the 
mean. This is due to the fact that there are often many caGrid services for the 
same project that differ to one another very slightly or not at all, which can 
result in multiple similar or identical results. 

Ontology Generation, Module Extraction and Classification. In or- 
der to isolate any overhead caused by variations in network performance, we 
extracted the XML corresponding to each project (or information model) in 
caDSR. This is a preliminary step to run the performance evaluator locally, and 
we do not include any data about the performance of this stage. We gener- 
ate four ontologies for each project: the NCIt module ontology (incorporating 
the concepts from NCIt relevant to the project), the annotated UML ontology 
(including the classes describing the UML model) and inferred versions of the 
two ontologie^^ We recorded the time for each generation and Figure [i] shows 
them for the four ontologies per project grouped by subset. The times are pre- 
sented in a logarithmic scale. We can see that the vast majority (75%) of NCIt 
modules take less than 2 seconds to generate and even less time for ontology 
generation. The classification of the generated ontologies is also timely, with the 
average inference of the Pellet and HermiT reasoners never taking longer than 
100 milliseconds. 




caDSR caGRID Info Models caDSR caGRID Info Models 

Fig. 4: The generation and inference times. 



Query Rewriting Evaluation. We have developed a suite of queries of 
varying complexity in order to evaluate the query rewriting. The results are 
presented in figure [5] which shows the average timep^ taken at each stage of the 



We generate the inferred ontologies classifying the generated ontologies using both 
the HermiT and Pellet reasoners. 

Each query was ran 5 times and the average times calculated. 
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query rewriting proces^jfor five queries whose rewriting has path length one, 
five queries whose rewriting has path length two and the mean times for these 
two sets. The query's path length refers to the number of intermediate nodes in 
the rewritten query. We can see from figure [5] that, while the path length has a 
marked effect on the time taken at the path finding stage, the other stages of 
implementation remain largely unaffected. We therefore maintain that, given our 
analysis of paths within our target ontologies described above, we can provide 
query rewriting in a timely and efficient manner. 



Path Length 1 Path Length 2 Mean Query Time 




stage Stage stage 

Fig. 5: Query rewriting results at varying path lengths. 



5 Related Work 

The UML-to-OWL transformation has been studied in different contexts and 
applications varying from the detection of inconsistencies in UML diagrams to 
use as interchangeable modeling artifacts |14|15j . We have also provided different 
alternative transformations before [515] and have improved it here so that the 
UML transformation conforms with 0WL2EL profile, where the semantic anno- 
tations use subsumption and additionally, primary concepts and qualifiers are 
modelled as sequences. 

The use of semantic web technologies to support semantic queries over dis- 
tributed data environments in biomedicine have been implemented in systems 
such DartGrid 16J (for traditional Chinese medicine), ACGT[T7] and semCDI[T5] 
(for cancer). To the best of our knowledge, the latest is the only work over the 
caGrid infrastructure. All these systems support SPARQL queries over the re- 
sources, while our system allows for high-level descriptive queries which do not 
need to be based on the structure of particular resources. Additionally, our ap- 
proach using MCC as an intermediary language provides support for optimisa- 
tions and generality, as a different target language could be used, even SPARQL. 



Tliese correspond to the stages of query rewriting; parsing, UML extraction, vali- 
dation, path finding, MCC conversion and CQL conversion. For more information, 
refer to section 14.2 
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6 Conclusions 

This paper presented the design and implementation of an ontology-based query- 
ing functionality implemented over a service-oriented, model-driven infrastruc- 
ture aimed at sharing cancer research data. In particular, the implementation 
was based on the caGrid infrastructure, but the approach could be used over 
similar model-driven software infrastructures. We presented: a) the generation 
of customised 0WL2 ontologies from annotated UML models, based on the 
IS011179 standard for metadata registries, which differs from traditional UML- 
to-OWL conversions and it is an improvement from [3 , mainly as we now gen- 
erate 0WL2EL ontologies for the UML models and support annotations with 
primary concept and qualifiers; b) an analysis of the generated ontologies by de- 
termining several relevant and bespoke ontology metrics concerning paths and 
their complexity, which justify the viability of our query rewriting/ translation 
technique; c) a caGrid analytical service implementing the OWL Generation fa- 
cility; d) an analysis of the capabilities of the caGrid query language, and the 
queries it supports; e) a significant revision and improvement of the query rewrit- 
ing and translation steps to transform a domain ontology-based query into CQL; 
f) an extensive performance evaluation of the OWL generation and module ex- 
traction, plus an assessment of the querying rewriting and translation process 
and its viability. In future work, we will extend the query rewriting/translation 
evaluation providing a varied query set, explore the use of an 0WL2EL reasoner 
to improve performance of the path finding process and support federated queries 
across data resources, where the selection of join conditions will be provided by 
a semantic analysis of the distributed resources. 
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